Document Organization and Retrieval using Self Organizing Maps and Statistical Language Modeling
نویسندگان
چکیده
In this paper we present a method for document organization and retrieval based on statistical language modeling.The proposed method, which is based on the vector model, uses nonlinear interpolation to provide more accurate statistical estimators of the conditional probabilities employed for encoding the context of each word. An information retrieval system is built using the self-organizing map algorithm. In the rst step, the self-organizing architecture is used to cluster the feature vectors and to build clusters of semantically related words. Subsequently, the collection of documents is encoded into vectors and the same algorithm is used to cluster the documents in contextually related classes. The information retrieval system is queried using a sample document and the corresponding precision-recall curve is provided.
منابع مشابه
An approach based on language modeling and neural networks
This thesis covers topics relevant to information organization and retrieval. The main objective of the work is to provide algorithms that can elevate the recall-precision performance of retrieval tasks in a wide range of applications ranging from document organization and retrieval to web-document pre-fetching and finally clustering of documents based on novel encoding techniques. The first pa...
متن کاملSelf-organizing Maps in Natural Language Processing
Kohonen's Self-Organizing Map (SOM) is one of the most popular arti cial neural network algorithms. Word category maps are SOMs that have been organized according to word similarities, measured by the similarity of the short contexts of the words. Conceptually interrelated words tend to fall into the same or neighboring map nodes. Nodes may thus be viewed as word categories. Although no a prior...
متن کاملA combination of Wilcoxon test and R-estimates for document organization and retrieval
The Wilcoxon signed-rank test is exploited for document organization and retrieval in this paper. A novel modeling method for documents and a distance metric between documents are proposed. Both document modeling and document comparisons are based on signed-ranks and are applied to the frequency of occurrence of the document bigrams. A metric using the Wilcoxon signed-rank test exploits these s...
متن کاملDocument Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps
Clustering and visualization of large text document collections aids in browsing, navigation, and information retrieval. We present a document clustering and visualization method based on Latent Dirichlet Allocation and self-organizing maps (LDA-SOM). LDA-SOM clusters documents based on topical content and renders clusters in an intuitive twodimensional format. Document topics are inferred usin...
متن کاملIndexing Audio Documents by using Latent Semantic Analysis and SOM
This paper describes an important application for state-of-art automatic speech recognition , natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection an...
متن کامل